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Abstract — The  multi-armed  bandit  problem  and  one  of  its 
most  interesting  extensions,  the  restless  bandits  problem,  are 
frequently  encountered  in  various  stochastic  control  problems. 
We  present  a  linear  programming  relaxation  for  the  restless 
bandits  problem  with  discounted  rewards,  where  only  one 
project  can  be  activated  at  each  period  but  with  additional 
costs  penalizing  switching  between  projects.  The  relaxation 
can  be  efficiently  computed  and  provides  a  bound  on  the 
achievable  performance.  We  describe  several  heuristic  policies; 
in  particular,  we  show  that  a  policy  adapted  from  the  primal- 
dual  heuristic  of  Bertslmas  and  Nino-Mora  [1]  for  the  classical 
restless  bandits  problem  is  in  fact  equivalent  to  a  one-step 
lookahead  policy;  thus,  the  linear  programming  relaxation 
provides  a  means  to  compute  an  approximation  of  the  cost-to- 
go.  Moreover,  the  approximate  cost-to-go  is  decomposable  by 
project,  and  this  allows  the  one-step  lookahead  policy  to  take 
the  form  of  an  index  policy,  which  can  be  computed  on-line 
very  efficiently.  We  present  numerical  experiments,  for  which 
we  assess  the  quality  of  the  heuristics  using  the  performance 
bound. 

1.  Introduction 

In  the  field  of  stochastic  optimization,  the  multi-armed 
bandit  (MAB)  model  is  of  fundamental  importance  because 
it  is  known  to  be  solvable  efficiently  despite  its  generality. 

In  the  MAB  problem,  we  consider  N  projects,  of  which  only 
one  can  be  worked  on  at  any  time  period.  Each  project 
i  is  characterized  at  (discrete)  time  t  by  its  state  x,(t).  If 
project  i  is  worked  on  at  time  f,  one  receives  a  reward 
a‘r{xi{t)),  where  a  G  (0,1)  is  a  discount  factor.  The  state 
Xi{t)  then  evolves  to  a  new  state  according  to  given  transition 
probabilities.  The  states  of  all  idle  projects  are  unaffected. 
We  assume  perfect  state  information  and  (in  this  paper)  we 
consider  only  a  finite  number  of  states  for  each  project.  The 
goal  is  to  find  a  policy  which  decides  at  each  time  period 
which  project  to  work  on  in  order  to  maximize  the  expected 
sum  of  the  discounted  rewards  over  an  infinite  horizon. 

The  MAB  problem  was  first  solved  by  Gittins  [2],  [3],  who 
showed  that  it  is  possible  to  attach  to  each  project  an  index 
that  is  a  function  of  the  project  and  of  its  state  alone,  and 
that  the  optimal  policy  is  simply  characterized  by  operating 
at  each  period  the  project  with  the  greatest  current  index. 
Moreover,  these  indices  can  be  efficiently  calculated.  Whittle 
[4]  proposed  an  extension  of  the  model,  called  the  restless 
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bandits  problem  (RB),  in  which  one  can  activate  several 
projects  at  each  time  period,  and  the  projects  that  remain 
passive  continue  to  evolve,  possibly  using  different  rules. 
Finding  an  optimal  policy  efficiently  for  the  RB  problem 
is  unlikely  to  be  possible  however,  since  the  problem  was 
shown  to  be  PSPACE-hard  [5],  even  in  the  case  when 
only  one  project  is  active  at  each  period  and  deterministic 
transition  rules  are  in  effect. 

Another  extension  of  the  MAB  model  concerns  the  addi¬ 
tion  of  costs  for  changing  the  currently  active  project.  This 
problem  is  of  great  interest  to  various  applications  where 
the  MAB  formulation  also  applies,  in  order  to  model  for 
example  set-up  and  tear-down  costs  in  queuing  networks  [6], 
or  transition  costs  in  a  job  search  problem  (see  the  survey  in 
[7]).  This  paper  is  motivated  by  an  optimal  search  problem 
in  the  context  of  aerial  surveillance. 

Compared  with  the  classical  MAB  problem,  relatively  few 
papers  are  devoted  to  RB  problems  and  multi-armed  bandit 
problems  with  switching  costs  (MABSC).  In  the  same  spirit 
as  this  paper,  [1]  presented  relaxations  providing  bounds 
on  the  achievable  performance  for  the  RB  problem.  [8] 
and  [9]  study  the  indexability  of  the  RB  problem  and  the 
MABSC  problem  respectively.  However  in  the  latter  case 
the  switching  costs  must  be  decomposable  as  set-up  and 
tear-down  costs,  which  is  not  valid  in  general  if  the  costs 
represent  travel  distances.  Other  contributions  to  the  MABSC 
problem  include  [10]  and  [11].  [12]  solves  a  two-armed 
bandit  problem  with  switching  costs  analytically,  in  the  case 
of  deteriorating  rewards. 

In  this  paper,  we  consider  the  restless  bandits  problem 
with  switching  costs  (RBSC),  when  only  one  project  can  be 
set  active  at  each  time  period.  In  Section  II,  we  formulate 
the  RBSC  problem  in  the  general  framework  of  Markov 
decision  processes  (MDP).  We  also  briefly  review  the  state- 
action  frequency  approach  used  here  to  obtain  the  linear 
programming  formulation  of  MDPs.  In  Section  III,  we  show 
that  even  the  special  case  of  the  MABSC  is  NP-hard,  and  we 
propose  a  first-order  linear  relaxation  of  the  RBSC  problem, 
providing  an  efficiently  computable  bound  on  the  achievable 
performance.  Section  IV  describes  heuristics  to  solve  the 
problem  in  practice  and  finally.  Section  V  presents  numerical 
experiments  comparing  these  heuristics. 


II.  Exact  Formulation  OF  THE  RBSC 
A.  The  State-Action  Frequency  Approach  to  MDPs 

In  this  section,  we  first  review  the  linear  programming 
approach  based  on  occupation  measures  to  formulate  Markov 
decision  processes  (MDP).  The  RBSC  problem  is  then  for¬ 
mulated  in  this  framework.  A  (discrete-time)  MDP  is  defined 
by  a  tuple  {X,A,  as  follows: 

•  X  is  the  finite  state  space. 

•  A  is  the  finite  set  of  actions.  A(x)  C  A  is  the  subset 
of  actions  available  at  state  x.  JF  =  {(x, a)  :  x  €  X, a  € 
A{x)}  is  the  set  of  state-action  pairs. 

•  are  the  transition  probabilities,  ti^xay  is  the  proba¬ 
bility  of  moving  from  state  x  to  state  y  if  action  a  is 
chosen. 

•  r  :  JF  — >  M  is  an  immediate  reward. 

We  define  the  history  at  time  t  to  be  the  sequence  of 
previous  states  and  actions,  as  well  as  the  current  state: 
ht  =  (xi,ai,X2,a2, . . .  Let  H;  be  the  set  of  all 

possible  histories  of  length  t.  A  policy  u  in  the  class  of 
all  policies  U  is  a  sequence  {ui,U2,. . If  the  history  ht 
is  observed  at  time  f,  then  the  controller  chooses  an  action 
a  with  probability  Ut{a\ht).  A  policy  is  called  a  Markov 
policy  (m  G  Um)  if  for  any  f,  Ut  only  depends  on  the  state 
at  time  t.  A  stationary  policy  (m  €  Us)  is  a  Markov  policy 
that  does  not  depend  on  t.  Under  a  stationary  policy,  the  state 
process  becomes  a  Markov  chain  with  transition  probabilities 
Pxy[u]  =Y.aeA(x)  ^xayu{a\x).  Finally,  a  stationary  policy  is 
a  deterministic  policy  (m  G  Uo)  if  it  selects  an  action  with 
probability  one.  Then  u  is  identified  with  a  map  u  :X^  A. 

We  fix  an  initial  distribution  v  over  the  initial  states.  In 
other  words,  the  probability  that  we  are  at  state  x  at  time  1 
is  v(x).  If  V  is  concentrated  on  a  single  state  z,  we  use 
the  Dirac  notation  v{x)  =  5^{x).  Kolmogorov’s  extension 
theorem  guarantees  that  the  initial  distribution  v  and  any 
given  policy  u  determine  a  unique  probability  measure  P“ 
over  the  space  of  trajectories  of  the  states  Xt  and  actions  A,. 
We  denote  E“  the  corresponding  expectation  operation. 

For  any  policy  u  and  initial  distribution  v,  and  for  a 
discount  factor  0  <  a  <  1,  we  define 

oo 

Ra{v,u)  =  (1  -a)E“ 

t=i 

oo 

=  (l-a)^a'^'E“r(X,,A,) 

t=i 

(the  exchange  of  limit  and  expectation  is  valid  in  the  case 
of  finitely  many  states  and  actions  using  the  dominated 
convergence  theorem). 

An  occupation  measure  corresponding  to  a  policy  u  is  the 
total  expected  discounted  time  spent  in  different  state-action 
pairs.  More  precisely,  we  define  for  any  initial  distribution 
V,  any  policy  u  and  any  pair  x  GX, a  G  A{x): 

oo 

fa{v,u\x,a)  :=  (1  -  a)  ^  (X,  =  x,At  =  a). 

t=\ 

The  set  {fa{v,u-,x,a)}x,a  defines  a  probability  measure 
fa{v,u)  on  the  space  of  state-action  pairs  that  assigns 


probability  fa{v,u-,x,a)  to  the  pair  (x,a).  fa{v,u)  is  called 
an  occupation  measure  and  is  associated  to  a  stationary 
policy  w  defined  by: 


w{a\y) 


fa(v,u;y,a) 

llaeAfa{v,u-,y,a) 


,Vy  e  X,a  e  A(y), 


(1) 


whenever  the  denominator  is  non-zero,  in  which  case  we  can 
choose  w(a|y)  arbitrarily.  We  can  readily  check  that 


Ra(v,M)  =  ^  ^/a(v,M;x,a)r(x,fl).  (2) 


Let  2“(v)  to  be  the  set  of  vectors  p  G  satisfying 


j  llyexHaeA(y)  Py,a(^x(y)  (^ii^yax)  —  (1  OC)v(x),VxGX 
\p>.,a>0,  VyGX,aGA(y). 

(3) 

2“(v)  is  a  closed  polyhedron.  Note  that  by  summing  the 
first  constraints  over  x  we  obtain  Py  a  =  1,  so  p  satisfying 
the  above  constraints  defines  a  probability  measure.  It  also 
follows  that  2“(v)  is  bounded,  i.e.,  is  a  closed  polytope. 
One  can  check  that  the  occupation  measures  fa{v,u)  belong 
to  this  polytope.  In  fact  2“(v)  describes  exactly  the  set  of 
occupation  measures  achievable  by  all  policies  (See  [13]  for  a 
proof):  the  extreme  points  of  the  polytope  Qa{v)  correspond 
to  deterministic  policies,  and  each  policy  can  be  obtained 
as  a  randomization  over  these  deterministic  policies.  Thus, 
one  can  obtain  a  (deterministic)  optimal  occupation  measure 
corresponding  to  the  maximization  of  (2)  as  the  solution  of 
a  linear  program  over  the  polytope  2“(v). 


B.  Exact  Formulation  of  the  RBSC  Problem 

In  the  RBSC  problem,  N  projects  are  distributed  in  space 
at  N  sites,  and  one  server  can  be  allocated  to  a  chosen  project 
at  each  time  period  f  =  1,2, ....  In  the  following,  we  use  the 
terms  project  and  site  interchangeably.  At  each  time  period, 
the  server  must  occupy  one  site.  We  say  that  a  site  is  active  at 
time  t  if  it  is  visited  by  the  server,  and  is  passive  otherwise. 
If  the  server  travels  from  site  k  to  site  I,  we  incur  a  cost  Cki- 
Each  site  can  be  in  one  of  a  finite  number  of  states  G  Sn, 
for  n  =  1, . . .  ,A,  and  we  denote  the  Cartesian  product  of  the 
individual  state  spaces  y  =  S\  x  . . .  x  Sn-  If  site  n  in  state 
x„  is  visited,  a  reward  r\{xn)  is  earned,  and  its  state  changes 
to  y„  according  to  the  transition  probability  p\^y^  ■  If  the  site 
is  not  visited,  then  a  reward  (potentially  negative)  r^{x„)  is 
earned  for  that  site  and  its  state  changes  according  to  the 
transition  probabilities  Px„y„  ■  We  assume  that  all  sites  change 
their  states  independently  of  each  other.  Note  that  if  the 
transition  costs  are  all  0,  we  recover  the  initial  formulation 
of  the  RB  problem  [4]  for  one  server.  If  additionally  the 
passive  rewards  are  0  and  the  passive  transition  matrix  is  the 
identity  matrix,  we  obtain  the  MAB  problem.  If  we  just  add 
the  switching  costs  to  the  basic  MAB  problem,  we  call  the 
resulting  model  MABSC. 

The  RBSC  problem  can  be  cast  into  the  MDP  framework. 
We  denote  the  set  {1, . . .  ,N}  by  [N].  The  state  of  the  system 
at  time  t  is  given  by  the  state  of  each  site  and  the  position 
s  G  [X]  of  the  server.  We  denote  the  complete  state  by 


(xi, . . .  ,XAr;s)  :=  (x;s).  We  can  choose  which  site  is  to  be 
visited  next;  i.e.,  the  action  a  belongs  to  the  set  [A^] .  Once  the 
next  site  to  be  visited  is  chosen,  there  is  a  cost  Csa  for  moving 
to  the  new  site,  including  possibly  a  nonzero  cost  for  staying 
at  the  same  site.  The  reward  earned  is  rjj{xa)  + 

We  are  given  a  distribution  v  on  the  initial  state  of  the 
system,  for  example  v(xi, . . .  ,xn',s)  =  Vi(xi) . . .  Vn{xn)Si{s) 
if  the  initial  states  of  the  sites  are  independent  and  the 
server  leaves  initially  from  site  1 .  We  also  define  the  notation 
Lxe-y  =  LxieSi  ■  ■■'LxNeSN-  Then  the  linear  program  for  the 
resulting  MDP  can  be  written; 


maximize 

N  N 


EE  EC 

4-  ^ 

^sa)p(^xi 

i=lxGjXa=l 

subject  to 

N 

N  N 

E  P[^-X),a  - 

«E  E  Ep(x' 

■y),a'^{x';s')a{x;s) 

a=\ 

s'=lx'ey  a=l 

>1 

1 

V(x,5)  G  ^  X  [A^] 

P(x;5),(2  — 

V((x;s),a)  G  ,5^ 

X  [N]\ 

with  the  decision  variables  P(x-,s),a  corresponding  to  the 
occupation  measures. 

In  our  case,  a  non-zero  transition  probability  occurs  only 
if  the  target  site  is  the  same  as  our  selected  action  because 
we  assume  that  the  movement  of  the  server  is  deterministic. 
Using  the  independence  assumption,  we  obtain  the  following 
equality  constraints  on  the  variables: 

N  N 

52  P(x;s),a  ~  H  H  P(x';.s'),.sPy  jcj  •  ■■Px',x,  ■  '■Px'^XN 

a=\  s'=lx'ey 

=  (1  -  a)  v(x,s),  V(x,s)  G  ,5^  X  [A^]  (5) 

In  this  formulation,  our  decision  variables  are  P(xi,...,XN'x),a- 
Thus,  there  are  |5i  |  x  . . .  x  \Sn\  x  variables,  i.e.,  a  number 
exponential  in  the  size  of  the  input  data.  For  example,  if  we 
consider  a  problem  with  A^  =  10  sites  and  5  states  for  each 
site,  we  obtain  a  linear  program  with  more  than  976  x  10^’ 
variables,  and  therefore  real  world  instances  of  the  problem 
cannot  be  solved  by  feeding  this  formulation  directly  into  an 
LP  solver. 

III.  Linear  Programming  Relaxation  oe  the  RBSC 
A.  Complexity  of  the  RBSC  and  MABSC  problems. 

It  could  well  be  that  the  problem  appears  difficult  to 
solve  only  because  of  our  own  inability  to  formulate  it 
in  an  efficient  manner.  In  the  most  general  formulation 
above,  we  know  however  that  the  problem  is  likely  to  be 
intractable,  since  the  RB  problem  is  already  PSPACE-hard 
[5],  even  for  the  case  of  deterministic  transition  rules  and 
one  server.  Here  we  show  that  the  special  case  MABSC  is 
also  difficult.  In  this  Section,  we  also  denote  (with  a  slight 
abuse  of  notation)  MABSC  as  the  recognition  version  of  the 
optimization  problem  considered  before;  that  is,  given  an 
instance  of  the  MABSC  problem  and  a  real  number  L,  is 
there  a  policy  that  achieves  a  total  expected  reward  greater 


than  or  equal  to  LI  This  problem  is  obviously  easier  than  the 
full  version  of  the  optimization  problem.  The  fact  that  it  is 
NP-hard  is  deduced  from  the  same  result  for  the  HAMILTON 
CIRCUIT  problem. 

Theorem  1:  MABSC  is  AP-hard. 

Proof:  Recall  that  in  the  HAMILTON  CIRCUIT  prob¬ 
lem,  we  are  given  a  graph  G  =  {V,E),  and  we  want  to 
know  if  there  is  a  circuit  in  G  visiting  all  nodes  exactly 
once.  HAMILTON  CIRCUIT  was  actually  one  of  the  first 
combinatorial  problems  proven  to  be  NP-complete  [14]. 
HAMILTON  CIRCUIT  is  a  special  case  of  MABSC.  Indeed, 
given  any  graph  G  =  {V,E),  we  construct  an  instance  of 
MABSC  with  A=  |y|  sites,  travel  costs  =  1  if  {i,j}  G  E, 
and  Cij  =  2  otherwise.  We  choose  arbitrarily  one  of  the  sites 
to  be  the  site  where  the  server  is  initially  present.  This  site 
has  only  one  state,  and  the  reward  for  having  the  server 
present  at  this  site  at  any  period  is  0.  To  each  of  the  (A  —  1) 
other  sites,  we  associate  a  Markov  chain  with  two  states.  To 
the  first  state  is  associated  a  reward  of  2  (discounted  by  a  at 
each  time  period),  and  after  a  visit  to  a  site  in  this  state,  the 
site  moves  with  probability  1  to  the  second  state,  associated 
with  a  reward  of  (—1)  (discounted  by  a  at  each  time  period). 
Once  in  this  state,  the  chain  remains  there  with  probability 
1. 

Now  since  a  >  0,  it  is  clear  that  a  policy  can  achieve  a 
reward  of  1  4-  a  4- ...  -I-  if  and  only  if  there 

exists  a  Hamilton  circuit  in  G.  The  only  possible  policy 
actually  just  moves  the  server  along  the  Hamilton  circuit 
without  stopping,  except  when  the  server  is  back  at  the  initial 
site.  ■ 

Note  that  this  easy  result  is  in  strong  opposition  to  the  case 
without  switching  costs,  where  a  greedy  policy  based  on  the 
Gittins  indices  for  each  site  (computable  in  polynomial  time) 
is  known  to  be  optimal.  However,  the  result  is  not  completely 
satisfying.  It  captures  the  complexity  of  the  combinatorial 
problem  present  in  MABSC  (which  looks  like  a  traveling 
salesman  problem)  but  not  the  complexity  of  the  whole 
scheduling  problem.  Indeed,  even  in  the  case  of  two  sites,  the 
problem  remains  in  general  difficult  (see  for  example  [11]). 

B.  A  Relaxation  for  the  RBSC 

The  discussion  in  the  previous  paragraph  serves  as  a 
justification  for  the  introduction  of  a  relaxed  formulation 
of  the  RBSC  and  MABSC  problems.  The  MAB  framework 
naturally  leads  to  a  Markov  decision  process  for  each  site. 
This  observation  was  already  used  by  Whittle  [4]  in  the  case 
of  restless  bandits,  together  with  a  relaxation  of  the  con¬ 
straints  tying  the  projects  together.  This  relaxation  allowed 
him  to  decompose  the  formulation  by  project  ([15]  uses  the 
same  idea).  Here  we  essentially  extend  the  method  to  RBSC. 
References  for  our  work  include,  in  particular,  the  paper  by 
Bertsimas  and  Niiio-Mora  on  restless  bandits  [1]. 

Consider  a  policy  u  and  a  distribution  v  on  the  initial 
states  of  the  sites;  the  initial  states  are  assumed  independent 
(i.e.  v(xi, . . .  ,xa^;s)  =  Vi(xi) . . .  Vn{xn)8\{s)).  These  generate 


an  occupation  measure  for  each  site  (x,;s),a),/ = 

1, . . .  ,N,  where 

oo 

fUv,u-  {Xi-,s),a)  =  (1  -  a)  E“  ^  a'-'  1{x/=x.,s,=.a=«}- 

r=l 

These  occupation  measures  can  be  thought  of  as  projections 
of  the  measure  for  the  complete  problem  [16],  or  in  terms 
of  probabilities  as  marginals.  Indeed,  we  have 

/a(v,M;(x,;s),a)  = 

^  ^  ^  Y.  fa{v,u-,{xu---,XN-,s),a). 

JCiGSi  j:,+ie5i+i  xnSSn 

By  partial  summation  over  the  constraints  in  (5),  one  can 
see  that  these  measures  belong  to  (and  now  in  general  the 
inclusion  is  strict)  the  polytopes  2“(v,)  defined  as  follows: 


P(xrx) 

1^  5,-jxA^^  . 

2^a=l 

P(x,; 

i),a  ~  ^  ^x'.eSi  EX 

hP( 

x';s'),iPx^Xi 

(1 

-a)Vi(x,)Si(i) 

Vxi 

&Si 

^a=l 

P/r- 

[Xi, 

x),a  ~  ^Ex'eSiE^-_ 

^iPi 

i  pO 

' x':\s')  x'jXi 

(1 

-a)Vi(xi)Si(s) 

Vx,- 

G  Si,  s  7^ 

(6) 

For  all  N  sites,  we  have  now  a  total  of  0{N^  x  max, (15,1)) 
variables  and  constraints,  i.e.,  a  number  polynomial  in  the 
size  of  the  input,  and  therefore  from  the  discussion  in 
paragraph  III-A  it  is  unlikely  that  these  variables  will  suffice 
to  formulate  the  RBSC  problem  exactly.  However,  we  can 
try  to  reduce  the  size  of  the  feasible  region  spanned  by  the 
new  decision  vectors,  by  finding  additional  constraints  not 
present  in  (6).  Indeed,  the  occupation  measures  for  each  site 
are  projections  of  the  same  original  vector  and  therefore  are 
tied  together. 

We  can  use  the  intuitive  idea  of  enforcing  constraints  on 
average  to  relax  a  hard  problem.  At  a  given  time  f,  the  server 
is  switching  from  one  site  to  exactly  one  other  site,  and  all 
marginal  occupation  measures  should  reflect  the  same  change 
on  the  information  about  the  server  position.  That  is,  at  each 
time  period  t,  we  have: 

Y  ^{x;=Xi,S,=s,A,=a}=  Y  ^{X}=Xi,S,=sA=a}^  V!,s,ae[A^]. 

Xi^Si 

We  relax  these  constraints  to  enforce  them  only  on  average, 
which  implies  for  the  occupation  measures 

Y  P(xi-s),a=  Y  Plxi-X),a^  V!,S,flG[A].  (7) 

XjeSj 

Now  note  that,  in  fact,  this  intuitive  interpretation  leads 
to  an  equation  that  is  clearly  true  by  simply  summing  the 
original  occupation  measure  over  all  states  of  all  sites,  but 
not  over  s  and  a.  The  idea  of  relaxing  a  constraint  enforced 
at  each  time  step  into  a  constraint  enforced  on  average  was 
central  in  the  original  work  of  Whittle  [4].  We  see  here 
that  the  state-action  frequency  approach  allows  us  to  derive 
additional  constraints  between  projects  automatically. 


Our  final  linear  programming  relaxation  for  RBSC  is: 

maximize 

YY  Y  (''a{Xa)-Csa)P^^^.s),a+Y  Y  ''^i^^i'^P{xf,s),a 

5=liZ=l  Xa^Sa  j^QXj^Sj 

(8) 

subject  to 

P'  •=  {Plxi;s),a}{xi,s,a}  ^  Q?  i'^i)  ^  €  [N] 

E  P{xr,s),a=  Y  Pixi-x),a^  yi,s,ae[N]. 

Xi€Sj 

As  noted  earlier,  the  number  of  variables  and  constraints 
in  this  program  is  polynomial  in  the  size  of  the  input. 
Computing  the  optimal  value  of  this  linear  program  can 
therefore  be  done  in  polynomial  time,  and  provides  an  upper 
bound  on  the  performance  achievable  by  any  policy  for  the 
original  problem. 

A  few  remarks  can  be  made  about  this  formulation.  First, 
we  could  obtain  tighter  relaxations  by  considering  marginals 
involving  several  sites  at  the  same  time.  In  the  limit  case,  we 
obtain  the  exact  formulation  when  all  sites  are  considered  si¬ 
multaneously.  This  idea  is  followed  in  [1].  Let’s  also  mention 
that  in  the  original  work  on  restless  bandits.  Whittle  used 
Bellman’s  equation  of  optimality  to  formulate  the  problem. 
Additional  constraints  tying  the  projects  together  can  then 
be  handled  using  the  theory  of  Lagrange  multipliers.  If 
the  state -frequency  approach  is  often  preferred  to  deal  with 
constrained  MDPs,  the  dynamic  programming  and  Lagrange 
multipliers  approach  has  sometimes  been  used  as  well  (eg. 
in  [15]),  since  it  allows  the  use  of  the  various  algorithms 
developed  for  dynamic  programming  problems.  If  the  linear 
programming  method  is  used  to  solve  the  dynamic  program, 
the  corresponding  linear  program  is  the  dual  of  the  one 
obtained  using  occupation  measures.  We  can  of  course  obtain 
the  dual  of  (8)  directly: 

N  N 

minimize  (1-a)^^  Y  Mxi)Si{s)?,^.^^  (9) 
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IV.  Heuristics  FOR  THE  RBSC  PROBLEM 


We  now  present  algorithms  to  solve  the  RBSC  problem  in 
practice.  On  specific  examples,  when  the  optimal  solution  is 
too  costly  to  compute,  the  relaxation  presented  in  section  III 
provides  an  upper  bound  on  the  achievable  performance. 
Based  on  ideas  developed  in  [1],  we  present  a  primal-dual 
heuristic  obtained  from  the  linear  programming  relaxation. 
We  then  show  that  this  heuristic  can  be  viewed  as  a  one- 
step  lookahead  policy  [17].  Experimentally,  this  heuristic 
performed  well  over  a  wide  range  of  problems. 


A.  A  Simple  Greedy  Algorithm 

Perhaps  the  simplest  policy  for  the  RBSC  problem  is 
the  following  greedy  policy:  in  state  (xi,. . .  ,xn',s),  send  the 
server  to  the  site  that  maximizes  the  marginal  instantaneous 
reward,  i.e.,  G  argmax^{r](xfl)  -  r“(xa)  -  Qa}.  This 

policy  is  optimal  in  the  case  of  the  MAB  problem,  with 
no  transition  costs,  and  deteriorating  rewards  (i.e.,  projects 
become  less  profitable  as  they  are  worked  on)[3].  It  is  also 
a  one-step  lookahead  policy  where  we  approximate  the  cost- 
to-go  by  0  (see  (12)). 

B.  Reconstructing  the  Original  Occupation  Measures 

Solving  the  linear  program  (8)  provides  values  for  the 
marginals  of  the  original  occupation  measures.  Moreover, 
each  sum  in  (7)  can  be  interpreted  as  the  probability  for  the 
server  to  be  at  site  s  and  switch  to  site  a.  Then,  one  can  try 
to  reconstruct  occupation  measures  for  the  original  problem, 
for  example  by  defining 

1 

with  P{s,a)  =  LxigSi  P(xi;s),a-  Then  from  (7),  we  immediately 
have  'Lj^i'LxjeSjP(xi,...^N-,s),a  =  P(xr,s-),a-  algorithm 

simply  chooses,  in  state  (xi, . . . ,XAr;s),  action  a  with  prob¬ 
ability  P(xi,...,x;,;A,a/'^a=iP(xt,...,x;,;s),a  T  the  denominator  is 
non  zero,  or  leaves  the  server  at  its  current  position  other¬ 
wise.  Of  course,  these  new  variables  p  do  not  satisfy  (4) 
in  general,  and  it  is  difficult  to  justify  this  construction 
theoretically. 

C.  A  Primal-Dual  Index  Heuristic 

1)  Construction  from  the  LP  relaxation:  For  the  linear 
programs  (8)  and  (9),  we  can  obtain  optimal  primal  and 
dual  solutions  {p(_^_.,)  J  and  We  call  J 

the  corresponding  reduced  costs,  which  are  given  by  the 
difference  between  the  left  and  right  hand  side  in  the 
constraints  of  (9).  These  reduced  costs  are  nonnegative, 
there  is  one  such  coefficient  corresponding  to  each  variable 
of  the  primal  p|^  and  additionally  by  complementary 
slackness  yi  ,  =  0  if  pi  ^  >  0.  Bertsimas  and  Nino- 

Mora  motivated  their  heuristic  for  the  RB  problem  using  the 
following  interpretation  of  the  reduced  costs:  starting  from 
an  optimal  solution,  ^  is  the  rate  of  decrease  in  the 
objective  value  of  the  primal  linear  program  (8)  per  unit 
increase  in  the  value  of  the  variable 

We  use  this  interpretation  and  the  intuitive  idea  that  taking 
action  a  when  project  i  is  in  state  x,  and  the  server  at  s  implies 
in  some  sense  increasing  the  value  of  In  particular, 

we  would  like  to  keep  the  quantities  p|^..^)  ^  found  to  be  0  in 
the  relaxation  as  close  to  0  as  possible  in  the  final  solution. 
Since  for  these  variables  we  can  have  0,  when  the 

system  is  in  state  (xi, . . . ,XAr;s),  we  associate  to  each  action 
a  an  index  of  undesirability 

N 

1=1 


o' 

r{xi-,s),a  •  •  'r{xN;s],a 


(10) 


that  is,  we  sum  the  reduced  costs  for  the  N  different  projects. 
Then  we  select  action  apd  that  minimizes  these  indices: 

apd{xi,...,XN-,s)  G  argmin^{/((xi,...,XAr;s),a)}.  (11) 


Note  that  this  procedure  always  provides  a  deterministic 
policy. 

2 )  Interpretation  as  a  One-Step  Lookahead  Policy:  There 
is  an  interesting  alternate  way  of  viewing  the  primal- 
dual  heuristic  from  a  dynamic  programming  point  of  view. 
For  state  (xi, . . .  ,xa^;s),  we  form  an  approximation  of  the 
(infinite-horizon)  cost  to  go  as:  J{xi, . . .  ,xn',s) 

By  summing  over  the  constraints  in  (9),  one  can  readily 
see  that  the  vector  /  is  a  feasible  vector  for  the  linear 
program  obtained  from  Bellman’s  equation  for  the  original 
problem,  or  alternatively  as  the  dual  of  (4).  That  is,  /  is  a 
superharmonic  vector,  and  we  recall  that  the  exact  optimal 
cost  is  the  smallest  superharmonic  vector  [18].  By  obtaining 
a  tight  relaxation,  including  the  additional  constraints  on  the 
marginals,  we  can  obtain  a  vector  /  which  is  closer  to  the 
optimal  cost-to-go  vector.  Now  the  one-step  lookahead  policy 
is  obtained  by  maximizing 


'^x^eS; 


4ieSa 


It  is  a  straightforward  calculation,  using  the  definition  of 
the  reduced  costs  given  above,  to  verify  that  this  maximiza¬ 
tion  is  equivalent  to  the  minimization  in  (11). 

5 )  Computational  Aspects  and  Approximate  Indices:  The 
one-step  lookahead  policy  above  can  be  computed  in  real¬ 
time  in  practice.  Note  that  we  can  rewrite 
— r®(xa)  +  r® (x;)  and  the  sum  does  not  depend  on  a. 
Therefore  the  one-step  lookahead  heuristic  is  implemented 
on-line  by  choosing  in  state  (xi,. . .  ,XAr;s)  the  action  that 
maximizes  the  approximate  indices 


apd  G  argmax^{mx,i(a)},  (12) 

m,„, (a)  =r](x„)-r"(x„ )-£„  +  «  ^  Y.  PlaA.- 

'  '  x'aeSa 

This  expression  motivates  the  definition  of  the  simple  greedy 
policy  given  previously.  We  can  store  the  maxi(|5,j)) 

optimal  dual  variables  Xj. ,  and  compute  these  indices  in  time 
max,(|5,j)).  Equivalently,  as  we  have  seen  above,  we 
can  store  the  max,(|5,j))  optimal  reduced  costs  and 

compute  the  indices  appearing  in  (11)  in  time  0{N^). 


V.  Numerical  Experiments 

We  now  consider  problems  whose  characteristics  differ¬ 
ently  affect  the  performance  of  the  heuristics  presented  in 
section  IV.  We  also  compute  the  optimal  performance  as  the 
solution  of  the  linear  program  (4)  when  allowed  by  the  size 
of  the  state  space,  and  an  upper  bound  on  this  performance 
using  the  relaxation  (8)  in  any  case.  Linear  programs  are 
implemented  in  AMPL  and  solved  using  CPLEX.  Due  to 
the  size  of  the  state  space,  the  expected  discounted  reward 


TABLE  I 

Numerical  Experiments 


Problem 

a 

Z* 

Z' 

^f^reedy 

Z 

Zpd 

Problem  1 

0.9 

96.4 

109.4 

96 

94 

70 

Problem  2 

0.9 

- 

281.7 

246 

186 

213 

Problem  3 

0.9 

- 

195.3 

82 

96 

107 

Problem  4 

0.9 

- 

1799.4 

1680 

1450 

1713 

Problem  5 

0.9 

- 

1121.5 

985 

648 

943 

of  the  various  heuristics  is  computed  using  Monte-Carlo 
simulations.  The  computation  of  each  trajectory  is  terminated 
after  a  sufficiently  large,  but  finite  horizon;  in  our  case, 
when  a}  times  the  maximal  absolute  value  of  any  immediate 
reward  becomes  less  than  10^®.  The  server  is  assumed  to 
start  from  site  1.  To  reduce  the  amount  of  computation  in 
the  evaluation  of  the  policies,  we  assume  that  the  distribution 
of  the  initial  states  is  deterministic. 

We  adopt  the  following  nomenclature: 

•  Z*:  Optimal  value  of  the  problem  (when  available). 

•  Z*;  Optimal  value  of  the  relaxation. 

•  '^greedy-  Estimated  expected  value  of  the  greedy  heuris¬ 
tic. 

•  Z:  Estimated  expected  value  of  the  heuristic  reconstruct¬ 
ing  occupation  measures  from  (10). 

Estimated  expected  value  of  the  primal-dual  index 
heuristic  (i.e.,  one-step  lookahead). 

Problem  1  is  a  2-armed  bandit  (i.e.,  passive  projects  are 
frozen,  passive  rewards  are  0,  and  there  are  no  switching 
costs),  with  10  states  for  each  project  and  deteriorating 
rewards.  Therefore  the  greedy  heuristic  performs  optimally. 
Problem  2  is  an  MAB  problem  with  8  projects,  10  states  per 
project  and  deteriorating  rewards.  Z*  was  not  computed  but 
can  be  obtained  from  the  simulation  of  the  greedy  policy, 
since  we  know  that  it  is  optimal.  Problem  3  is  a  MABSC 
problem,  with  5  projects,  10  states  per  project  and  deteriorat¬ 
ing  rewards.  One  of  the  projects  has  a  high  immediate  reward 
at  the  beginning,  but  is  relatively  remote,  so  we  expected 
the  greedy  policy  to  perform  poorly.  Problems  4  and  5  were 
RBSC  problems  with  randomly  generated  data. 

The  results  of  the  numerical  experiments  on  the  different 
problems  are  given  in  table  I.  Over  the  range  of  problems,  the 
primal-dual  index  heuristic  performs  reasonably  well.  Note 
that  since  we  do  not  assume  the  distances  to  be  symmetric, 
it  is  very  easy  to  design  a  problem  where  the  greedy  policy 
performs  badly  by  associating  to  a  site  a  high  immediate 
reward  at  the  beginning,  but  also  a  high  switching  cost  to 
leave  this  site.  The  gap  between  the  original  optimal  value 
and  the  relaxation  is  seen  to  be  less  than  15%  in  all  examples, 
except  possibly  in  problem  3  where  none  of  the  policies 
approaches  the  bound.  It  is  not  clear  at  this  point  which 
factors  make  the  primal-dual  index  heuristic  underperform. 

VI.  Conclusions 

In  the  spirit  of  existing  work  on  the  restless  bandits  prob¬ 
lem,  we  have  presented  a  linear  programming  relaxation  for 
the  restless  bandits  with  switching  costs  problem.  This  set-up 


is  quite  powerful  to  model  a  wide  range  of  dynamic  resource 
allocation  problems.  Since  the  problem  is  computationally 
intractable  and  an  optimal  solution  can  in  general  not  be 
directly  computed,  the  relaxation  is  useful  in  providing  a 
bound  on  the  achievable  performance.  We  also  presented 
several  heuristics  to  solve  the  problem  in  practice,  as  well 
as  their  performances  on  specific  examples.  An  interesting 
part  of  the  analysis  showed  the  link  between  the  linear 
programming  relaxation  and  a  standard  suboptimal  control 
method.  We  also  showed  that  the  relaxation  can  be  obtained 
automatically  by  first  formulating  the  original  problem  as  a 
Markov  decision  process  on  the  whole  state  space  and  then 
considering  specific  marginals  of  the  occupation  measure. 
Euture  work  will  focus  on  trying  to  design,  for  special 
cases,  approximation  algorithms  with  a  guaranteed  worst- 
case  performance. 
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